Reproductive toxicology. Trichloroethylene.

نویسندگان

  • Stefan Wolfsheimer
  • Bernd Burghardt
  • Alexander K Hartmann
چکیده

Background: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the highprobability region, which is biologically less relevant. Results: We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments. Conclusion: Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included. Published: 11 July 2007 Algorithms for Molecular Biology 2007, 2:9 doi:10.1186/1748-7188-2-9 Received: 5 October 2006 Accepted: 11 July 2007 This article is available from: http://www.almob.org/content/2/1/9 © 2007 Wolfsheimer et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 17 (page number not for citation purposes) Algorithms for Molecular Biology 2007, 2:9 http://www.almob.org/content/2/1/9 Background Sequence alignment is a powerful tool in bioinformatics [1,2] to detect evolutionarily related proteins by comparing their sequences of amino acids. Basically one wants to determine the "similarity" of the sequences. For example, given a protein in a database like PDB [3], such similarity analysis can be used to detect other proteins, which are evolutionary close to it. Related approaches are also used for the comparison of DNA sequences, i.e. shotgun DNA sequencing [4], but the application to DNA is not considered in this article. Alignment algorithms find optimum alignments and maximum alignment scores S of two or more sequences for a given scoring system. Needleman and Wunsch suggested a method to compute global alignments [5], whereas the Smith-Waterman algorithm [6] aims at finding local similarities. Insertions and deletions of residues are taken into account by allowing for gaps in the alignment. Gaps yield a negative contribution to the alignment score and are usually modeled by a gap-length l depending score function g (l). Widely used are affine gap costs because for two given sequences of length L and M, because fast algorithms with running time (LM) are available for this case [7]. Note that for database queries even this is too complex, hence fast heuristics like BLAST [8] are used there. By itself, the alignment score, which measures the similarity of two given sequences, does not contain any information about the statistical significance of an alignment. One approach to quantify the statistical significance is to compute the p-value for a given score S. This means under a random sequence model one wants to know the probability for the occurrence of at least one hit with a score S greater than or equal to some given threshold value b, i.e. (S ≥ b). Often E-values are used instead. They describe the number of expected hits with a score greater than or equal to some threshold value. One possible access to the statistical significance can be achieved under the null model of random sequences. Then the optimal alignment score S becomes a random variable and the probability of occurrence of S under this model P (s) = (S = s) provides estimates for p-values. Analytic expressions for P (s) are only known asymptotically in the case of gapless alignments of long sequences, where an extreme value distribution (also called Gumbel distribution) [9,10] was found. For alignments with gaps, such analytical expressions are not available. Approximation for scenarios with gaps based on probabilistic alignment [11-13], large deviations [14] and a Poisson model [15] had been developed. Altschul and Gish [16] investigated the score statistics of random sequences for a number of scoring systems and gap parameters by computer simulations: They obtained histograms of optimum scores for randomly sampled pairs of sequences by simple sampling. By curve fitting, they showed that in the region of high probability the extreme value distribution describes the data well, also for gapped alignments of finite sequences. Additionally, they found that the theoretical predictions for the relation between the scoring system on one side and the Gumbel parameters on the other side hold approximately for gapped alignments. In this context they obtained two improvements: Using a correction to account for finite sequence lengths and sum statistics of the k-best alignments, theoretical predictions for ungapped alignments could be applied more accurately to gapped alignments. Recently Olsen et al. introduced the "island method" [17,18], which accelerates sampling time. BLAST [8] uses precomputed data, generated with the island method, to estimate E-values. In any case, as already pointed out, the studies in Ref. [16] and [18] give reliable data in the region where P (s) is large only. This is outside the region of biological interest because pairs of biologically related sequences have a higher similarity than pairs of purely randomly drawn sequences. To overcome this drawback a rare-event sampling technique was proposed recently [19], which is based on methods from statistical physics. This general approach allows to obtain the distribution over a wide range, in the present case down to P (s) = 10-40. So far this method has been applied to one relevant case only, namely protein alignment with the BLOSUM 62 score matrix [7] and affine gap costs with α = 12 opening and β = 1 extension costs. It turned out that at least for one scoring matrix and one set of gap-cost parameters, the distribution deviates from the Gumbel form in the biologically relevant rareevent tail, where simple sampling methods fail. Empirically, a Gaussian correction to the original distribution was proposed for this case. Results as in Ref. [19] are only useful if one obtains the distribution for a large range of parameter values which are commonly used in bioinformatics. It is the purpose of this work to study the distribution of S for other relevant cases. Here we consider the BLOSUM62 and the PAM250 score matrices in connection with various parameters α , β of affine gap costs. The paper is organized as follows. In the second section we define alignments formally and state a few main results on the statistics of local sequence alignment. Next, we state the rare-event approach used here and in the fourth section we explain our approach in detail. We introduce some toy examples which are also used to evaluate the convergence properties of the algorithm. In the fifth section, we present our results for BLOSUM62 and  Page 2 of 17 (page number not for citation purposes) Algorithms for Molecular Biology 2007, 2:9 http://www.almob.org/content/2/1/9 PAM 250 matrices in conjunction with different affine gap costs. We show also our results for the sum statistics of the k largest alignments. In the last section, we summarize and discuss our results. Statistics of local sequence alignment In this section, we define sequence alignment, and state some analytical results for the distribution of the optimum scores S over pairs of random sequences. Let x = x1x2 ... xL and y = y1y2 ... yM be two sequences over a finite alphabet Σ with r = |Σ| letters(e.g. nucleic acids or amino acids). An alignment is a set = {(ik, jk} of K pairs of "non-crossing" indices (k = 1, 2, ..., K 1, 1 ≤ ik <ik+1 ≤ L and 1 ≤ jk <jk+1 ≤ M) identifying pairs of letters from the two sequences. Letters, which are not paired are called unpaired or gapped. A gap g of length lg is a substring of lg gapped letters from one sequence. Note, that this representation [14] of an alignment is equivalent to an introduction of a gap symbol, as commonly used. Formally the gap cost function can be defined by considering the length of a gap beginning at the kth pairing in sequence x or sequence y respectively, in detail The score (x, y, ) of the local alignment of the two sequences is composed of a sum over all aligned pairs and a sum over all gaps of both sequences: where σ (a, b) a, b ∈ is the given score matrix (or substitution matrix) and g (l) the gap-cost function with g (0) = 0. Note that the alignment is local, because the (possibly large) gaps at the beginning and the end of each sequence are not included in the scoring function. Otherwise the alignment would be global. Here, we consider the BLOSUM62 [20] and the PAM250 [21,22] matrices and affine gap costs, i.e. g (l) = α + β (l -1). The similarity of the sequences is the optimum alignment with the maximum score which can be obtained in (LM) time [7]. In the case of gapless optimum local alignments of two random sequences of L and M independent letters from Σ with frequencies {fa } with a ∈ Σ and ∑a fa = 1, referred as null model, the score statistics can be calculated analytically in the asymptotic regime of long sequences [9,10]. In this case one obtains the Gumbel distribution (KarlinAltschul statistics) [23] (S ≥ b) = 1 exp [KLM e-λb] (3)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reproductive toxicology. Trichloroethylene.

These centers are: the Agencourt Bioscience Corporation [2] for the Drosophila ananassae, Drosophila mojavensis, and Drosophila virilis sequence data, the Genome Sequencing Center at the Washington University School of Medicine [3] for the Drosophila yakuba, Drosophila simulans, and Caernorhabditis remanei sequence data, the Human Genome Sequencing Center at the Baylor College of Medicine [4] f...

متن کامل

Role of methanogenic and sulfate-reducing bacteria in the reductive dechlorination of tetrachloroethylene in mixed culture.

Several reports have demonstrated biotransformation of tetrachloroethylene at low concentrations under strict anaerobic conditions by sequential reductive dechlorination (Fathepure and Boyd, 1988a; Freedman and Gosset, 1989). During this biodegradation, trichloroethylene (TCE), 1,1dichloroethylene, vinylidene chloride (DCE), and vinyl chloride (VC) are the intermediate products ; ethene or etha...

متن کامل

Inhibition of CYP2E1 reverses CD4+ T-cell alterations in trichloroethylene-treated MRL+/+ mice.

Trichloroethylene is an organic solvent that is primarily used as a degreasing agent for metals. There is increasing evidence in both humans and animal models that trichloroethylene promotes the development of autoimmunity, but little is known about the mechanisms that mediate the effect of trichloroethylene on the immune system. Metabolic activation of trichloroethylene is considered an obliga...

متن کامل

Simulation of the toxicokinetics of trichloroethylene, methylene chloride, styrene and n-hexane by a toxicokinetics/toxicodynamics model using experimental data.

The toxicokinetics/toxicodynamics (TKTD) model simulates the toxicokinetics of a chemical based on physiological data such as blood flow, tissue partition coefficients and metabolism. In this study, Andersen and Clewell's TKTD model was used with seven compartments and ten differential equations for calculating chemical balances in the compartments (Andersen and Clewell 1996, Workshop on physio...

متن کامل

Effects of trichloroethylene and perchloroethylene on wild rodents at Edwards Air Force Base, California, USA.

Effects of inhalation of volatilized trichloroethylene (TCE) or perchloroethylene (PCE) were assessed based on the health and population size of wild, burrowing mammals at Edwards Air Force Base (CA, USA). Organic soil-vapor concentrations were measured at three sites with aquifer contamination of TCE or PCE of 5.5 to 77 mg/L and at two uncontaminated reference sites. Population estimates of ka...

متن کامل

In silico toxicology: simulating interaction thresholds for human exposure to mixtures of trichloroethylene, tetrachloroethylene, and 1,1,1-trichloroethane.

In this study, we integrated our understanding of biochemistry, physiology, and metabolism of three commonly used organic solvents with computer simulation to present a new approach that we call "in silico" toxicology. Thus, we developed an interactive physiologically based pharmacokinetic (PBPK) model to predict the individual kinetics of trichloroethylene (TCE), perchloroethylene (PERC), and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Environmental Health Perspectives

دوره 105  شماره 

صفحات  -

تاریخ انتشار 1997